Zillow Prize Competition EDA by Hadi Ramezani-Dakhel

About the competition:

Zillow’s Home Value Prediction competition is a two-round competition in which the ultimate goal is to improve the home-price prediction algotrithm of the Zillow company (aka Zestimate).

During the first round of the competition, the objective is to predict the Zestimate’s residual error. It means we need to predict where Zestimate fails and where it succeeds:

\[logerror=log(Zestimate)-log(SalePrice)\]

To make a successful predictive model our algorithm must be as good as Zillows’ algorithm (not better and not worse). In the second stage, however, the objective is to actually improve the home value prediction algorithm.

My goal is to do an EDA on the data set made available to us in the first round (https://www.kaggle.com/c/zillow-prize-1/data). First, let’s load some libraries.

Let’s read in the data first.

Let’s take a brief look at the data sets. The variables, their type, etc.

## 'data.frame':    90275 obs. of  61 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ id_parcel               : int  10711738 10711755 10711805 10711816 10711858 10711910 10712086 10712162 10712163 10712195 ...
##  $ logerror                : num  0.0276 -0.0182 -0.1009 -0.0121 -0.0481 ...
##  $ date                    : Factor w/ 352 levels "2016-01-01","2016-01-02",..: 211 211 122 94 193 239 43 267 232 130 ...
##  $ type_aircon             : int  1 1 1 1 1 NA 1 1 1 1 ...
##  $ type_architectural      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ area_basement           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ num_bathroom            : num  3 3 2 2 2 2 2 3 3 3 ...
##  $ num_bedroom             : int  4 3 3 4 4 3 4 3 4 3 ...
##  $ type_framing            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ type_quality            : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ num_bathroom_calc       : num  3 3 2 2 2 2 2 3 3 3 ...
##  $ type_deck               : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ area_firstfloor_finished: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ area_total_calc         : int  2538 1589 2411 2232 1882 1477 1850 3193 2421 1678 ...
##  $ area_live_finished      : int  2538 1589 2411 2232 1882 1477 1850 3193 2421 1678 ...
##  $ area_liveperi_finished  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ area_total_finished     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ area_unknown            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ area_base               : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ fips                    : int  6037 6037 6037 6037 6037 6037 6037 6037 6037 6037 ...
##  $ num_fireplace           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ num_bath                : int  3 3 2 2 2 2 2 3 3 3 ...
##  $ num_garage              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ area_garage             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ flag_tub                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ type_heating            : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ latitude                : int  34220381 34222040 34220427 34222390 34222544 34221864 34226039 34226833 34226843 34223689 ...
##  $ longitude               : int  -118620802 -118622240 -118618549 -118618631 -118617961 -118615739 -118618527 -118612917 -118612422 -118612746 ...
##  $ area_lot                : num  11012 11010 11723 9002 9002 ...
##  $ num_pool                : int  1 1 1 NA 1 1 1 1 1 NA ...
##  $ area_pool               : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ pooltypeid10            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ pooltypeid2             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ pooltypeid7             : int  1 1 1 NA 1 1 1 1 1 NA ...
##  $ zoning_landuse_county   : Factor w/ 78 levels "","0","010","0100",..: 5 5 5 4 5 5 5 5 5 4 ...
##  $ type_zoning_landuse     : int  261 261 261 261 261 261 261 261 261 261 ...
##  $ zoning_property         : Factor w/ 1997 levels "","1NR1*","1NR3*",..: 634 634 638 638 638 634 638 634 634 634 ...
##  $ rawcensustractandblock  : num  60371132 60371132 60371132 60371132 60371132 ...
##  $ region_city             : int  12447 12447 12447 12447 12447 12447 12447 12447 12447 12447 ...
##  $ region_county           : int  3101 3101 3101 3101 3101 3101 3101 3101 3101 3101 ...
##  $ region_neighbor         : int  268588 268588 268588 268588 268588 268588 268588 268588 268588 268588 ...
##  $ region_zip              : int  96339 96339 96339 96339 96339 96339 96339 96339 96339 96339 ...
##  $ num_room                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ type_story              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ num_75_bath             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ type_material           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ num_unit                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ area_patio              : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ area_shed               : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ build_year              : int  1978 1959 1973 1973 1973 1960 1974 1964 1962 1961 ...
##  $ num_story               : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ flag_fireplace          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tax_building            : num  245180 254691 235114 262309 232037 ...
##  $ tax_total               : num  567112 459844 384787 437176 382055 ...
##  $ tax_year                : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ tax_land                : num  321932 205153 149673 174867 150018 ...
##  $ tax_property            : num  7219 6901 4877 5560 4878 ...
##  $ tax_delinquency         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tax_delinquency_year    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ censustractandblock     : num  6.04e+13 6.04e+13 6.04e+13 6.04e+13 6.04e+13 ...

The data set contains 60 variables and 90275 observations.

Univariate Plots Section

Our objective is to predict the residual error. So, let’s plot a histogram of that. I’ll plot all histograms in terms of percentages to better understand them. I’ll first do a summary to help me setup the graph.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -4.60500 -0.02530  0.00600  0.01146  0.03920  4.73700

The x-axis on the left plot is the logerror. The logerror actually shows the order of magnitude differences between the estimated values and the sale price. For instance, logerror= 1 indicates that Zestimate overestimated a property value by an order of magnitude, e.g. the sale price was $100,000 but Zestimate predicted $1000,000! Let’s get a summary of the logerror:

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -4.60500 -0.02530  0.00600  0.01146  0.03920  4.73700

Mean and medians are both positive. This indicates that Zestimate on average tends to overestimate the house prices. The box_plot on the right also shows that we have some outliers that we should be very careful about.

Let’s split this over three different counties to see how the distribution looks like.

## [1] "1286" "2061" "3101"

While the distributions look similar, the code “3101” has longer tails. This could simply be because of higher number of transactions in that county. We’ll look at this again later.

I would like to get a overall idea of the size and the age of the buildings.

The size distribution is positively skewed which is not surprising. By scaling the x axis to log10, the histogram looks like a normal distribution. Although the values on the transformed graph spans over a single order of magnitude, I think this distribution would work better for making predictive models.

Let’s get a summary of the data as well.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       2    1184    1540    1773    2095   22741     661

The average size is 1773 sqft and the max size is 2273 sqft. These values make sense. However, min size is 2! This does not make sense. We should be careful with this when we try to fit a model to the data.

Let’s make a box plot here as well:

We have some outliers here as well.

Let’s make a similar plot with the tax_total (this is the assessed value of the properties for tax purposes). I would predict to observe a similar behavior.

And yes, the graphs meet my expectations. However, I see a shoulder on log10 transformed histogram of total_tax which is quite interesting.

Next, I’ll create a new feature called “age”. This feature contains the age of each building. I’ll then plot a histogram of age.

We don’t have too many new buildings in the data set. Let’s also get a summary of the data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   30.00   47.00   48.47   64.00  132.00     756

The mean age of the building is ~48 years. The mean and medians are very close here.

Finally, let’s me make a bar plot of the type of heating systems.

Heating type 2 (central) is the most common heating system.

Univariate Analysis

What is the structure of your dataset?

The main data set contains transactions/logerror data for 90275 properties with 58 features. The features include information such as size, tax, number of rooms, built year as well as x,y coordinates and zone information.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is the logerror which shows the order of magnitude deviation of the predicted price from the sales price.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think the coordinate of each property (x and y) and the density of properties in a region (this is not included in the dataset) are important features which could be related to the magnitude of logerror. I plan to plot the coordinates and logerror on a map in the following sections. Calculating the feature “density” is quite challenging and suitable for model building later on.

Did you create any new variables from existing variables in the dataset?

I created a new feature called “age” (the age of each properties expressed in number of years) to better show if a property is newly built or not. The existing feature “year_build” contains this information but I think it’s quite easier to comprehend the age rather than the year_build.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distribution of the graphs (logerror, size, age) were not unexpected. The distribution of the size histogram was right skewed similar to tax. I log-transformed those distribution and the transformed graphs looked like a normal distribution. However, the tax_total showed a shoulder after the transformation.

Bivariate Plots Section

I would like to look at the correlation between different features. There are now 61 features in the dataset! We either have to select important features by intuition or find another criterion! Let’s do both; first, I’ll drop features with more than 50% of missing values.

## 'data.frame':    90275 obs. of  36 variables:
##  $ X                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ id_parcel             : int  10711738 10711755 10711805 10711816 10711858 10711910 10712086 10712162 10712163 10712195 ...
##  $ logerror              : num  0.0276 -0.0182 -0.1009 -0.0121 -0.0481 ...
##  $ date                  : Factor w/ 352 levels "2016-01-01","2016-01-02",..: 211 211 122 94 193 239 43 267 232 130 ...
##  $ num_bathroom          : num  3 3 2 2 2 2 2 3 3 3 ...
##  $ num_bedroom           : int  4 3 3 4 4 3 4 3 4 3 ...
##  $ type_quality          : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ num_bathroom_calc     : num  3 3 2 2 2 2 2 3 3 3 ...
##  $ area_total_calc       : int  2538 1589 2411 2232 1882 1477 1850 3193 2421 1678 ...
##  $ area_live_finished    : int  2538 1589 2411 2232 1882 1477 1850 3193 2421 1678 ...
##  $ fips                  : int  6037 6037 6037 6037 6037 6037 6037 6037 6037 6037 ...
##  $ num_bath              : int  3 3 2 2 2 2 2 3 3 3 ...
##  $ flag_tub              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ type_heating          : Factor w/ 12 levels "1","2","6","7",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ latitude              : int  34220381 34222040 34220427 34222390 34222544 34221864 34226039 34226833 34226843 34223689 ...
##  $ longitude             : int  -118620802 -118622240 -118618549 -118618631 -118617961 -118615739 -118618527 -118612917 -118612422 -118612746 ...
##  $ area_lot              : num  11012 11010 11723 9002 9002 ...
##  $ zoning_landuse_county : Factor w/ 78 levels "","0","010","0100",..: 5 5 5 4 5 5 5 5 5 4 ...
##  $ type_zoning_landuse   : int  261 261 261 261 261 261 261 261 261 261 ...
##  $ zoning_property       : Factor w/ 1997 levels "","1NR1*","1NR3*",..: 634 634 638 638 638 634 638 634 634 634 ...
##  $ rawcensustractandblock: num  60371132 60371132 60371132 60371132 60371132 ...
##  $ region_city           : int  12447 12447 12447 12447 12447 12447 12447 12447 12447 12447 ...
##  $ region_county         : int  3101 3101 3101 3101 3101 3101 3101 3101 3101 3101 ...
##  $ region_zip            : int  96339 96339 96339 96339 96339 96339 96339 96339 96339 96339 ...
##  $ num_room              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ num_unit              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ build_year            : int  1978 1959 1973 1973 1973 1960 1974 1964 1962 1961 ...
##  $ flag_fireplace        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tax_building          : num  245180 254691 235114 262309 232037 ...
##  $ tax_total             : num  567112 459844 384787 437176 382055 ...
##  $ tax_year              : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ tax_land              : num  321932 205153 149673 174867 150018 ...
##  $ tax_property          : num  7219 6901 4877 5560 4878 ...
##  $ tax_delinquency       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ censustractandblock   : num  6.04e+13 6.04e+13 6.04e+13 6.04e+13 6.04e+13 ...
##  $ age_year              : num  39 58 44 44 44 57 43 53 55 56 ...

This leaves us with 37 features. We can now select some features manually and make correlation plots.

First, let’s create the correlation graphs for logerror, num_bedroom, num_bathroom, num_room, and num_units.

We see that there is a strong correlation between the number of bathrooms and the number of bedroom in a property which makes sense. Let’s make a separate plot for this.

The correlation between the logerror and the number of bedrooms shows that the distribution of the logerror is pretty wide for the properties with 3-5 bedrooms. Let’s also plot this separately.

Obviously, the logerror is broader for 3 bedroom apartments.

Now, let’s look at the correlation between the features related to the size and also the age of the buildings.

There is a perfect correlation between the “area_total_calc” and “area_live_finished” with correlation coefficient of 1! These two features should be combined together to create a single feature when making a predictive model.

The correlation plots between the “logerror” and “area_total_calc”, “area_live_finished”, and “area_lot” shows the logerror is smaller for larger properties. Let’s plot logerror and area_total_calc in a separate graph.

Again, we should keep in mind that the logerror tells us about the order of magnitude of deviations.

Finally, let’s make a correlation plot for the tax related features.

We see that there is not a strong correlation between the tax features and the logerror. However, there is a strong correlation between the tax properties themselves which is not surprising.

Let’s explore the month in which the transactions occurred and the correlation with logerror. To do this, I will create a new feature “trans_month” that contains the months information (1, 2, 3, etc). The reasoning behind this comes from the fact that “season change” drives the real estate market. I want to see whether Zestimate takes this into account or not, i.e. is there a strong correlation between the transaction month and the logerror?

This histogram only tells us about the transaction data available to us. It contains no information about the total number of transactions in each month.

Now, let’s plot the average logerror vs transaction month.

Very interesting results. The average logerror decreases as we enter the spring (March, April, etc) and it rises once the summer is finished (September, October, etc). This could very much be a general and periodic trend. However, we don’t have the logerror information over several years to confirm this.

We could also show this using box plots but the plot above is much simpler.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The correlation plots shows that in general logerror is smaller for larger properties. This indicates that Zestimates, in general, predicts the order of magnitude of sale prices for larger properties more accurately. This, however, does not mean that the actual dollar amounts are predicted more accurately. For example, when the logerror is 0.1 for a $100,000 house, the prediction is off by ~$26,000. The deviation from the sale price is ~$260,000 for a $1,000,000 house.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is a perfect linear relationship between “area_total_calc” and “area_live_finished”. The number of bathroom and bedroom in a property are also strongly correlated. The same is true for the building tax and land tax.

What was the strongest relationship you found?

While there hasn’t been a very strong correlation between different features and logerror, I found that the logerror is systematically influenced by seasons/months. The logerror is generally lower in spring and summer. The overall weak correlation between various features investigated in this section and my feature of interest (logerror), encourages me to look into other features (coordinates, etc) in multivariate plots section. It is very important to look at x, y, and logerror in a single plot (rather than in multiple bivariate plots). As such, this task will be completed in the next section.

Multivariate Plots Section

First, I would like to create a map and show x and y coordinates of each property using a circle. I also would like to color each circle based on the value of logerror. Here, I’ll show a fraction of the data (30%) which I’ll select randomly.

The yellow color dominates here which indicates that the logerror is very small in general. However, we can easily spot several blue (overestimated prices) and red (under-estimated prices) on the map. If we zoom in by a factor of 5-6 we’ll see quite an interesting behavior. The properties with large values of logerror are commonly more isolated. There are two potential theories on how the predictions are poor for those properties: 1) These properties are unique. For example, they are located in a special location (good or bad) which can potentially attract or repel buyers. The Zestimate prediction probably doesn’t take those factors into considerations 2) Zestimate takes advantage of the sale prices for other properties in close proximity of a property of interest. Accordingly, Zestimate fails to make an accurate prediction when there isn’t enough nearby data.

In the bivariate section, we saw that the logerror appears to be smaller for larger properties. This was true for both features related to the area (total area, area_live_finished) and tax related features. Let’s make a scatter plot of one feature from these two categories and color them based on the logerror values.

This plots shows we don’t have many extreme logerror for super-large and super-expensive properties. It also shows we have much less data points there. Therefore, I think “size/assessed tax value” do not heavily influence the logerror (this is also confirmed using the correlation plots). Let’s split this by region_county to see if anything changes.

Visually, it looks like we have more outliers in the county “3101”" similar to what we observed before in the univariate plots section.

Now, let me split this by type_quality feature.

I don’t see any distinct patterns in these plots.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The x,y coordinates appear to be quite important in predicting the feature of interest here (this is qualitative for now). Accurate predition of the logerror seems to be quite difficult without extensive feature creation/manipulations.

Were there any interesting or surprising interactions between features?

The map plot showed some interesting interactions with the logerror. It seems that to make a good model we must create some new features that takes the sale price of the nearest properties into account . This is supported by observation of large logerror for isolated properties, i.e. the properties with fewer number of neighbors.


Final Plots and Summary

Plot One

Description One

The distribution of logerror in three different counties appears to be normal. We see longer tails in county 3101. In general, Zestimates shows comparable performance in all these three counties of California.

Plot Two

Description Two

I think this plot is one of the most important findings of this EDA. The plot shows that Zestimate, on average, has the lowest logerrors in spring and summer. I believe the Zestimates’ algorithms likely takes the season change into account. However, it’s either difficult to predict the market during the fall and winter or their algorithms needs some improvements.

Plot Three: The Map Plot

A plot of the properties/logerror on a map

Description Three

This graphs contains a lot of information about the data. One can spend hours and hours on this graph and think about possible theories and new features to explain the logerror. I found that Zestimate commonly fails to accurately predict the price of the properties in less dense areas (this is a qualitative observation for now). Accordingly, I hypothesize that Zestimates relies on available sale data to make predictions.


Reflection

The training data set contains information on ~90,000 properties in three different countries of California. We are provided with 58 different features including the size related information (size of the properties, number of bedrooms, number of bathrooms, etc) as well as tax, and location properties. The ultimate goal is to predict the deviation of the Zillow’s model called Zestimate with the actual sale price. I first started my EDA by plotting some basic histograms to get a feel for the data. I looked at the histograms of logerror, size, and tax properties and performed some transformation of the data. Then, I made some correlation plots to explore the relationships between important features. I indeed found reasonable correlations between a number of features. However, there was not a strong correlation between different features and my feature of interest logerror. Next, I plotted a conditional summary (mean of logerror for each months), and I found an interesting correlation which showed the logerror is higher is fall and winter. I also plotted the x and y coordinates of each property on a map using the leaflet library which also revealed some interesting information.

One of the difficult task was splitting the histogram of logerror for different counties and showing the percentage values for each county instead of count. The first thing that came to my mind was using the facet_wrap in ggplot. Unfortuantely, it couldn’t get the percentage in each county using a simple facet_wrap. To accomplish this, I subset the data for each county.

My future plan is to make a predictive model and to participate in the Kaggle competition. The analysis that I performed here and the insights from the data are crucial for making a good model. I’ll create/combine features based on my EDA to make a predictive model.

References

The kaggle wesite, stackoverflow, r-bloggers, and other documentations were used in this EDA.